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Abstract 

Whether one should use null hypothesis testing, confidence intervals, and/or effect sizes 
is a source of continuing controversy in educational research. An alternative to testing for 
statistical significance, known as equivalence testing, is little used in educational 
research. Equivalence testing is useful in situations where the researcher wishes to show 
that two means are not significantly different. A common equivalence test for comparing 
the means of two independent samples is reviewed. A simulation study assessed the 
relationships between effect size, sample size, statistical significance, and statistical 
equivalence. An example of typical educational research data is reanalyzed using 
equivalence methodology. A tentative conclusion about the magnitude of effect size 
needed to be important is drawn. 
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The Use of Equivalence Testing in Conjunction with Standard 
Hypothesis Testing and Effect Sizes 

The use of statistical inference, particularly via null hypothesis significance testing, 
is an extremely common but contentious practice in educational research. Both the pros 
and the cons of hypothesis testing have been argued in the literature for several decades. 
Some support the continued usage of significance testing (Abelson, 1997; Hagan, 1997; 
Harris, 1997; McLean & Ernest, 1998), others desire a greater reliance on alternatives 
such as confidence intervals or effect sizes (Cohen, 1992, 1994; Knapp, 1998; 
Thompson, 1998a, 1998b; Vacha-Haase, 2001), and still others advocate an outright ban 
on significance testing (Carver, 1993; Nix & Barnette, 1998; Schmidt & Hunter, 1997). 
The references included here are by no means close to being an exhaustive list. This 
debate is not limited to our research community; for instance, it is also being argued in 
ecology (McBride, 1999; Anderson, Burnham, & Thompson, 2000). Many in the statistical 
community outside of our niche of educational and psychological research, though, are 
either unaware of this debate or feel that it is trivial (Krantz, 1999). 

The objective of this paper is not to continue this heated argument, but rather to 
borrow the method of equivalence testing from biostatistics, as suggested by Bartko 
(1991), and using it in conjunction with standard hypothesis testing in educational 
research. Lehmann (1959) anticated the need for interval testing in his classic volume on 
the theory of hypothesis testing. Many of the currently employed methods of equivalence 
testing were developed in the 1970’s and 1980’s to address biostatistical and 
pharmaceutical problems (Westlake, 1976, 1979; Schuirmann, 1981; Anderson & Hauck, 
1983; Patel & Gupta, 1984; Schuirmann, 1987). Rogers, Howard, and Vessey (1993) 
introduced the use of equivalence testing methods to the social sciences. Serlin (1993) 
essentially suggested equivalence testing when he suggested the use of “range”, rather 
than “point” , null hypotheses. 
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Background 

Standard null hypothesis significance testing dates back to the pioneering 
theoretical work of Fisher, Neyman, and Pearson. Hypothesis testing can be found in 
almost every textbook of statistical methods and thus will not be further elaborated on 
here. Equivalence testing, on the other hand, is a newer technique and one that is 
unfamiliar to most researchers in education and the social sciences. 

Equivalence testing was developed in biostatistics to address the situation where 
the goal is not to show that the mean of one group is greater than the mean of another 
group (i.e. the superiority of one treatment to another), but rather to establish that two 
methods are equal to one another. A common application of this idea in biostatistics is to 
show that a less expensive "generic” medication is as effective as the more expensive 
“brand-name” medication. In equivalence testing, the null hypothesis is that the two 
groups are not equivalent to one another, and hence rejection of the null indicates that the 
two groups are equivalent. This differs from standard significance testing where the null 
hypothesis states that the group means are equal and rejection of the null indicates that 
the two groups are statistically different. A common methodological mistake in research is 
to conclude that the null hypothesis is true (i.e, two groups have equal means) based on 
the failure to reject it. This action fails to recognize that the failure to reject H 0 is often 
merely a Type II error, especially when the sample sizes are small and the power of the 
test is low. 

An explanation of the theory of equivalence testing can be found in Berger and Hsu 

l 

(1996). Here, we will merely review the most commonly implemented method used for 
establishing the equivalence of two population means for an additive model , where the 
difference of means is considered. The multiplicative model , which looks at the ratio of 
means, will not be considered further in this paper. The commonly used procedure in 
biostatistics for this problem is to use the “two one-sided tests” procedure, or 




5 



Equivalence Testing with Hypothesis Testing 5 



TOST (Westlake, 1976, 1979; Schuirmann, 1981, 1987). With the TOST, the researcher 
will consider two groups equivalent if he can show that they differ by less than some 
constant r, the equivalence bound in both directions. The constant r is often chosen to 
be a percentage (such as 10% or 20%) of the mean of the control group, although r can 
also be chosen to be the smallest absolute difference between two means that is large 
enough to be practically important. 

The null hypothesis (i.e. the means are different) for the TOST is 

Ho : |mi -M 2 I > t 

or 

H 0 : Ml - M2 > t or Ml “M2 < ~T 

The alternative hypothesis (i.e. the means are equivalent) is 

Hi : |mi -M2I < t 



or 

H\ : — r < Mi — M2 < T 

The first one-sided test seeks to reject the null hypothesis that the difference 
between two means is less than or equal to -r; similarly, the second one-sided test 
seeks to reject the null hypothesis that the difference in the means is greater than or 
equal to r. If the one-sided test with the larger p-value leads to rejection, then the two 
groups are considered to be equivalent. 

For the first one-sided test, we compute the test statistic 

f _ X1 — X2+ TX [ 2 
, Spy/ljni + l/n 2 

where s p is the pooled standard deviation of the two samples and compute the p-value as 

Pi = P{U > h) 




6 



Equivalence Testing with Hypothesis Testing 6 



where t v is a random variable from the t distribution with v = m + n 2 - 2 degrees of 
freedom. 

The second one-sided test is similar to the first. The test statistic is 

t _ Xi - x 2 - TX 2 
Spy/lfni + l/n 2 

and the p-value is 

p 2 = P{t u < t 2 ) 

If we let p = max(pi,p 2 ), then the null hypothesis of nonequivalence is rejected if p < a. 

The choice of r is a difficult choice that is up to the researcher. This choice is 
analogous to the selection of an appropriate alpha level in standard significance testing, 
an appropriate level of confidence in interval estimation, or a sufficiently large effect size, 
and should be made carefully. Knowledge of the situation at hand should be used to 
specify the maximum difference between population means that would be considered 
clinically trivial. Researchers in biostatistics typically have the choice made for them by 
government regulation. 

As in standard hypothesis testing, an equivalency confidence interval can also be 
constructed. If the entire confidence interval is within (— r, r), then equivalence between 
the groups is indicated. If the entire confidence interval is within either (— r, 0) or (0,r) (i.e. 
zero is not in the interval), then we would reject the null hypotheses of both a significance 
and and equivalence test. In that case, we could make the somewhat discomforting 
conclusion that the difference of means was both statistically significant and equivalent. 

It is important to note that the equivalency confidence interval is expressed at the 
100(1 - 2a) % level of confidence. Rogers et al. (1993) noted that if one performs both a 
standard significance test and an equivalence test on the same data set, making either a 
“reject” or “fail to reject” decision, that there are four possibilities. These four conditions 
are given in Table 1 . 
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Insert Table 1 about here 



The second condition “equivalent and different” , a simultaneous rejection of both 
inferential procedures, could happen in a situation where large samples provide “too 
much power”, resulting in a trivial difference in means being statistically significant. The 
equivalence test (and the effect size) should detect the small magnitude of these mean 
differences. The fourth condition indicates that there is insufficient evidence to conclude 
that the groups are either equivalent or different. This would most likely occur when the 
samples are very small and/or the group variances are very large. 



Effect Size Measures for Difference of Means 



The effect size for the difference of means is the standardized difference between 
the groups (Fan, 2001). We will use the parameter 



s _ Mi ~ M2 

a 

to represent the effect size of the population , where m and ^ 2 are the population means 
and a 2 is the common variance. 

Of course, 6 is typically unknown and needs to be estimated. Cohen’s d (1988) is a 
statistic often used for this purpose. The effect size (ES) is found with 



Xi — X2 
Spooled 



where 



Spooled — 



(n t - 1 )jl + (n 2 - 1)^2 
n 1 + n 2 - 2 



is the pooled standard deviation of the two samples. We stress that Cohen’s d is a 
sample statistic and that d has a sampling distribution like other estimates (e.g. x). Also, 
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Cohen’s d is biased for 5 (i.e. E(d) / <5). A modification due to Hedges (1 981 , 1 982) is 
unbiased for 6. 

Cohen (1988) gave some suggestions for interpreting d. An effect size of d = 0.20 is 
deemed “small”, d = 0.50 is “medium”, and d = 0.80 is “large”. It is becoming, rather 
regrettably in our opinion, common for researchers to rigidly apply Cohen's suggestions. 
Absolute reliance on Cohen’s rule of thumb is as misguided as blind adherence to a 
particular level of significance (e.g. a = 0 . 05 ). As Thompson (2001 ) said, “we would 
merely be being stupid in another metric”. 

Typical Example of Educational Data 

Rogers et al. (1993) provided empirical examples of the application of equivalence 
testing on data from the psychological literature. Here, we will do the same with an 
example from the educational research literature. This will demonstrate that there often 
exist situations where a statistically significant difference between groups coincides with 
the groups being statistically equivalent. This is the “equivalent and different” condition 
that is typically associated with a small to moderate effect size, as opposed to the strong 
effect sizes that typically occur with the “different” condition and the weak effect sizes that 
occur with the “equivalent” condition. 

Benson (1989), in a study concerning statistical test anxiety, presented means and 
variances for a sample of 94 males and 123 females on seven variables. Using standard 
hypothesis testing methods (i.e. t-tests), significant group differences were found for: 
prior math courses, math self-concept, self-efficacy, and statistical test anxiety. However, 
after calculating Cohen’s d, 

, _ Xi - X2 
Sp 

as an effect size (ES) measure and the use of the TOST equivalence test, we see that 
only prior math courses and statistical test anxiety are “different” between males and 
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females. Not surprisingly, the two largest effect sizes are found for these two variables. 
Table 2 shows results of both traditional significance and equivalence tests for the Benson 
data. 

Statistical significance was defined as a rejection of Hq with a = 0.05 and 
equivalence was defined as a rejection of Hq with a = 0.10. The reason for the two 
different significance levels is because while a traditional significance test at level a 
corresponds to a 100(1 - a)% confidence interval, an equivalence test at level a 
corresponds to a 100(1 — 2a)% equivalence interval. We selected r = 0.2 (i.e. 20% of the 
mean of the female group). This choice was arbitrary and by no means should be taken 
as a choice recommended for all equivalence problems. The results could differ with 
different choices for r. 



Insert Table 2 about here 



The Power of Significance and Equivalence Tests 

The power function for the independent samples i-test is well known. For a test of 
statistical significance, power is the probability of rejecting the null hypothesis that the 
population means are equal when they are in fact not equal. Assuming equal variances, 
the power K S i g for the two-sided alternative Ha : - ^2 # 0 is given by (SAS Institute, 

1998; O’Brien & Lohr, 1984): 

Ksig = P(t < h_ a/2 ^ v, NC) + 1 - P(t < f a/2( „, v, NC) 

where v = m + n 2 - 2 are the degrees of freedom, t. ( „ is the • quantile of the (central) 
^-distribution with v df, and NC = ( Hi ^)/\/| is the noncentrality parameter. 



The power of an equivalence test is the probability of rejecting that the means are 
different by at least some equivalence bound r when the means are in fact equivalent (i.e. 
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differ by less than r). The computation of this power, K equ i V , requires use of the bivariate 
non-central ^-distribution (Phillips, 1990): 

K equiv = P{t\ — fl— ol,v and ^2 — 

where (fi,f 2 ) has a bivariate non-central f -distribution. Owen (1965) showed that 
probabilities from the bivariate non-central f-dsitribution could be computed as the 
difference of two definite integrals that are known as Owen’s Q functions. 

Statistical software packages, such as SAS/GRAPH (SAS Institute, 1998), can be 
used to graph the power functions of both the tests of statistical significance and 
equivalence. Figure 1 shows the power of the independent samples f-test and the TOST 
for various effect sizes with a sample size of n = 200 per group and an equivalence bound 
of r = 0.2. Figure 2 is a similar graph, but with r = 0.4. 



Insert Figure 1 about here 



Insert Figure 2 about here 



Figure 3 considers the power of the independent samples f-test and the TOST for 
samples sizes per group ranging from 10 to 500 with fixed effect size 8 = 0.2 and fixed 
equivalence bound r = 0.2. Figure 4 makes a similar comparison, with 8 = t = 0.4. 
Figure 5 uses 8 = 0.2 and r = 0.4. 



Insert Figure 3 about here 
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Insert Figure 4 about here 



Insert Figure 5 about here 



Simulation Study 

Of interest to us is the probability of rejecting both the null hypotheses (of 
non-significance and non-equivalence) simultaneously . We designed a small simulation 
study to assess the power of simultaneously concluding that two means are both 
statistically different and equivalent. 

As is always the case with Monte Carlo studies, the choices of simulation 
parameters are difficult to make and are somewhat arbitrary. We endeavored to simulate 
situations that were likely to be encountered in actual quantitative data analysis. We also 
made some simplifying assumptions to keep the number of simulations and associated 
tables and figures to a reasonable level. 

We assumed that both of our populations were always normally distributed with a 
common variance a 2 = 1. Six different sample sizes per group 
(n = 10, 20, 50, 100, 200, 500) were chosen; only equally sized groups were used in this 
study. Six different values for the effect size parameter (6 = 0, 0.1, 0.2, 0.3, 0.4, 0.5) were 
used, reflecting situations from no effect (i.e. equivalent population means) to a “medium” 
effect size (i.e. population means that differ by one half of a standard deviation). Three 
different equivalence bounds (r = 0.1, 0.2, 0.4) were used, defining the minimum 
difference between means that is practically important (i.e. non-equivalent) to be either 
10%, 20% or 40% of mi- 
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Hence, we have a fully crossed design with 6 x 6 x 3 = 108 cells. Within each cell 
(i.e. combination of sample size, effect size, and equivalence bound) , 10000 simulations 
were run. The R statistical computing environment was used to conduct the simulations. 
Each simulation consisted of generating n random normal variates with mean 0 + 6 and 
variance 1 and a second, independent set of n random normal variates with mean 0 and 
variance 1 . The independent samples Mest and the TOST with equivalence bound r was 
conducted for each simulation, and the number of rejections of each test, along with the 
number of simultaneous rejections of both procedures and the number of failures to reject 
either procedure, were noted. 

Tables 3 through 8 show the number of rejections of the null hypotheses of the 
equivalence test, both tests, the significance test, and neither test. Columns involving the 
equivalence test are in italics', columns involving the significance test are in bold-face. 
Note that the power of the equivalence test for each situation can be found by dividing the 
sum of the italicized columns by 10000. Similarly, the power of the significance test is 
obtained by dividing the sum of the columns in bold-face by 10000. 



Insert Table 3 about here 



Insert Table 4 about here 



Insert Table 5 about here 
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Insert Table 6 about here 



Insert Table 7 about here 



Insert Table 8 about here 



The results of the simulation study were used to obtain graphs of the approximate 
power of rejecting both tests simultaneously, for sample sizes ranging from 0 to 500 and 
equivalence bounds 0.1 < r < 0.5. The graphs were generated with the SAS/GRAPH 
software package, utilizing the G3GRID and G3D procedures. The G3GRID procedure 
interpolated, using the default method of bivariate interpolation (Akira, 1978; SAS 
Institute, 1998), the power of simultaneous rejection for combinations of n and r that were 
not included in the simulation design. The G3D procedure then produced a smoothed 
three-dimensional surface graph of the interpolated data set. Figure 6 is the power of 
simultaneous rejection with true effect size <5 = 0. Figures 7 is a similar graph for a “small” 
true effect size 6 = 0.2. 



Insert Figure 6 about here 



Insert Figure 7 about here 
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Discussion 

The data originally collected and analyzed with traditional significance tests by 
(Benson, 1989) showed a statistically significant difference between the means of male 
and female statistics students on six variables (GPA, number of prior math courses, math 
self-concept, self-efficacy, general test anxiety, and statistical test anxiety) and failed to 
find a significance for only one variable (achievement). We computed Cohen’s d as an 
effect size. Not surprisingly, the smallest absolute effect size of 0.04 was found for the 
non-significant variable, while the absolute effect sizes of the six significant variables 
ranged from 0.24 to 0.66. 

We then re-analyzed Benson’s data using the TOST procedure for testing for 
statistical equivalence. This analysis showed that only two variables, number of prior 
math courses and statistical test anxiety, were “different” (i.e. significant and not 
equivalent). Not coincidentally, these were the two variables with the strongest absolute 
effect sizes of 0.60 and 0.66. The non-significant variable (achievement) was found to be 
statistically equivalent, and the absolute effect size was virtually zero. Four of the 
variables (GPA, math self-concept, self-efficacy, and general test anxiety) yielded 
conflicting results of “equivalent and different” since they rejected the null hypotheses of 
both the statistical and equivalence tests. It is likely that the difference in the means of 
these four variables, while statistically significant, is trivial. The absolute effect sizes of 
these four variables ranged from 0.24 to 0.51 . This encompasses a range of effect sizes 
that is often classified as “small” to “medium” Cohen (1988), notwithstanding 
Lenth’s (2001) warnings against using “canned” effect sizes. 

We noticed that whenever the effect size 6 is less than the equivalence bound r, 
then the power of the equivalence test was approaching unity as n increased. This 
convergence was slow when 5 was nearly equal to r. Essentially, if the effect size 
parameter is less than the minimum difference that the researcher considers to be 
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practically important (i.e. the minimum difference between means large enough to 
matter), we will reject the null of the TOST and conclude equivalence with power 
increasing to unity with larger sample sizes. 

If 8 > t, the the power of the significance test approaches unity and the power of 

/ 

the equivalence test approaches zero as the sample size n increases. This is the 
situation where the effect size parameter exceeds the specified maximum for practical 
importance; we will reject the Mest and conclude statistical significance with power 
increasing to unity as the sample size increases. 

When 6 = t, then the power of the equivalence test will approach twice the nominal 
alpha level (e.g. 2a = 2 x 0.05 = 0.10). This occures because the effect size parameter 
happens to coincide with the specified equivalence bound. Rejecting the TOST (i.e. 
concluding equivalence) is a type I error, made with probability 2a. The probability is 
twice the nominal a since an equivalence test at level a corresponds to a 100(1 - 2a% 

v, 

equivalence interval. 

When 0 < S < r, then the power of both the significance and equivalence tests 
approaches unity (often slowly) as n increases. This is the situation where the null 
hypothesis of a significance test is false (i.e. the difference of means is not equal to zero), 
but the true difference is too small to be considered practically significant, where r is the 
minimum difference between means that is considered “important” . 

It appears to be somewhat common with “real” data to have situations where the 
tests of statistical significance and equivalence are simultaneously rejected for 
reasonable choices of significance level a and equivalence bound r. Our re-analysis of 
the Benson (1989) data yielded 4 simultaneous rejections out of 7 variables. Rogers et al. 
(1993) obtained 1 simultaneous rejection out of 13 when re-analyzing data from Cannon, 
Bell, Fowler, Penk, and Finkelstein (1990) concerning MMPI scores for alcohol versus 
drug-dependent subjects and 1 simultaneous rejection out of 27 from the study of Zabin, 
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Hirsch, and Emerson (1990) concerning differences between pregnant adolescent 
females who elect abortion versus those who carry the baby to term. 

The simulated power of simultaneous rejection found in our limited simulation study, 
as shown in Figures 6 and 7, showed that the probability of simultaneous rejection was 
low when the assumptions of the inferential tests (i.e. normality, equal variances, equal 
sample sizes between groups) was low except when both n and r were large. It is 
possible that “simultaneous rejection” will be more likely with real data than (at least our) 
simulated data because real data will surely violate the normality and homoscedasticity 
assumptions. We speculate simultaneous rejection will be more common, and thus 
potentially more problematic for the researcher using equivalence testing in conjunction 
with standard hypothesis testing, when the data is non-normal and heteroscedastic. 

We find the magnitude of effect sizes obtained from the statistical re-analysis of 
typical educational research data to be troubling. Benson’s data was of a decent size 
(groups of 94 and 123 subjects), but an effect size as large as 0.51 yielded both statistical 
significance (rejecting that the male mean was equal to the female mean) and 
equivalence (rejecting that the absolute difference of the male and female means were 
within a constant r). We make the tentative conjecture that the effect size conventions of 
Cohen (i.e. 0.2 is small, 0.5 is medium, 0.8 is large) might not be large enough. It is even 
possible that making any recommendation about the desired magnitude of an effect size 
independent of the sample sizes and variablity of the populations might be futile (Lenth, 
2001 ). 

It would be desirable to extend the simulation study to consider several scenarios 
ignored here. In particular, more attention needs to be given to situations where one of 
more of the following conditions are true: 

1 . the populations are non-normal; 

2. the variances are not equal; 
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3. the sample sizes of the groups are not equal. 

It would also be desirable to analtyically determine the power function for simultaneous 
rejection of the significance and equivalence tests, if possible. We will continue to strive 
for a greater understanding of the link between the effect size and the results of the 
significance and equivalence tests. 
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Table 1 

Possible Combinations of Significance and Equivalence Testing 



Significance Test Equivalence Test Term 



Fail to Reject 
Reject 

Reject 

Fail to Reject 



Reject 

Reject 

Fail to Reject 
Fail to Reject 



Equivalent 
Equivalent 
and Different 
Different 
Equivocal 
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Table 2 

Comparing Significance and Equivalence Testing for the Benson Data 



Variable 


Males 

{N = 94 ) 

M SD 


Females 

(N = 123 ) 
M SD 


Effect 

Size 


Sig. 

p-value 


Equiv. 

p-value 


Category 


GPA 


3.05 


0.44 


3.16 


0.47 


-0.24 


0.040 


< 0.001 


Equiv. & Diff. 


Prior Math 


















Courses 


3.45 


2.14 


2.20 


2.01 


0.60 


<0.001 


0.998 


Different 


Math 


















Self-concept 


25.77 


5.96 


23.20 


7.05 


0.39 


0.002 


0.012 


Equiv. & Diff. 


Self-efficacy 


12.68 


1.77 


11.62 


2.30 


0.51 


<0.001 


<0.001 


Equiv. & Diff. 


General Test 


















Anxiety 


36.38 


10.49 


40.62 


12.25 


-0.37 


0.004 


0.007 


Equiv. & Diff. 


Achievement 


32.56 


5.68 


32.26 


7.55 


0.04 


0.374 


<0.001 


Equivalent 


Statistical 


















Test Anxiety 


32.65 


12.57 


41.84 


14.83 


-0.66 


<0.001 


0.663 


Different 
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Table 3 

Simulated Power of the Tests of Statistical Equivalence and Significance, Effect Size 6 = 0 



Equivalence 


Sample Size 


Number of Rejections 




Bound 


(per group) 


(10000 Simulations) 




T 


N 


Equivalent 


Both 


Different 


Neither 


0.1 


10 


0 


0 


506 


9494 




20 


0 


0 


500 


9500 




50 


0 


0 


476 


9524 




100 


0 


0 


535 


9465 




200 


0 


0 


504 


9496 




500 


2337 


0 


511 


7152 


0.2 


10 


0 


0 


496 


9504 




20 


0 


0 


507 


9493 




50 


0 


0 


485 


9515 




100 


1063 


0 


546 


8391 




200 


5121 


0 


514 


4365 




500 


9386 


3 


490 


121 


0.4 


10 


10 


0 


486 


9504 




20 


370 


0 


469 


9161 




50 


5279 


0 


481 


4240 




100 


8757 


0 


457 


786 




200 


9493 


444 


63 


0 




500 


9483 


517 


0 


0 
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Table 4 

Simulated Power of the Tests of Statistical Equivalence and Significance, Effect Size 

<5 = 01 



Equivalence 


Sample Size 


Number of Rejections 




Bound 


(per group) 


(10000 Simulations) 




r 


N 


Equivalent 


Both 


Different 


Neither 


0.1 


10 


0 


0 


535 


9465 




20 


0 


0 


606 


9394 




50 


0 


0 


817 


9183 




100 


0 


0 


1118 


8882 




200 


0 


0 


1652 


8348 




500 


0 


709 


3366 


5925 


0.2 


10 


0 


0 


521 


9479 




20 


0 


0 


605 


9395 




50 


1 


0 


786 


9213 




100 


793 


0 


1090 


8117 




200 


3452 


0 


1687 


4861 




500 


6192 


15 


3486 


307 


0.4 


10 


11 


0 


565 


9424 




. 20 


347 


0 


622 


9031 




50 


4759 


0 


772 


4469 




100 


7902 


0 


1044 


1054 




200 


8361 


1196 


443 


0 




500 


6521 


3475 


4 


0 
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Table 5 

Simulated Power of the Tests of Statistical Equivalence and Significance, Effect Size 

8 = 02 



Equivalence 


Sample Size 


Number of Rejections 




Bound 


(per group) 


(10000 Simulations) 




r 


N 


Equivalent 


Both 


Different 


Neither 


0.1 


10 


0 


0 


727 


9273 




20 


0 


0 


962 


9038 




50 


0 


0 


1727 


8273 




100 


0 


0 


2865 


7135 




200 


0 


0 


5193 


4807 




500 


16 


0 


8880 


1104 


0.2 


10 


0 


0 


699 


9301 




20 


0 


0 


950 


9050 




50 


0 


0 


1678 


8322 




100 


408 


0 


2908 


6684 




200 


951 


0 


5207 


3842 




500 


915 


7 


8924 


154 


0.4 


10 


8 


0 


734 


9258 




20 


296 


0 


967 


8737 




50 


3397 


0 


1677 


4926 




100 


5485 


0 


2890 


1625 




200 


4886 


2800 


2314 


0 




500 


1167 


8534 


299 


0 
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Table 6 

Simulated Power of the Tests of Statistical Equivalence and Significance, Effect Size 

£ = 0.3 



Equivalence 


Sample Size 


Number of Rejections 




Bound 


(per group) 


(10000 Simulations) 




T 


N 


Equivalent 


Both 


Different 


Neither 


0.1 


10 


0 


0 


947 


9053 




20 


0 


0 


1540 


8460 




50 


0 


0 


3144 


6856 




100 


0 


0 


5594 


4406 




200 


0 


0 


8482 


1518 




500 


0 


0 


9973 


27 


0.2 


10 


0 


0 


985 


9015 




20 


0 


0 


1501 


8499 




50 


0 


0 


3203 


6797 




100 


104 


0 


5681 


4215 




200 


95 


0 


8524 


1381 




500 


19 


1 


9973 


7 


0.4 


10 


11 


0 


991 


8998 




20 


225 


0 


1563 


8212 




50 


2061 


0 


3133 


4806 




100 


2796 


0 


5602 


1602 




200 


1516 


2374 


6110 


2167 




500 


23 


6115 


3862 


0 
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Table 7 

Simulated Power of the Tests of Statistical Equivalence and Significance, Effect Size 

<5 = 0.4 



Equivalence 


Sample Size 


Number of Rejections 




Bound 


(per group) 


(10000 Simulations) 




r 


N 


Equivalent 


Both 


Different 


Neither 


0.1 


10 


0 


0 


1335 


8665 




20 


0 


0 


2333 


7667 




50 


0 


0 


5015 


4985 




100 


0 


0 


8069 


1931 




200 


0 


0 


9769 


231 




500 


0 


0 


10000 


0 


0.2 


10 


0 


0 


1344 


8656 




20 


0 


0 


2341 


7659 




50 


0 


0 


5077 


4923 




100 


23 


0 


8110 


1867 




200 


1 


0 


9784 


215 




500 


0 


0 


10000 


0 


0.4 


10 


9 


0 


1402 


8589 




20 


164 


0 


2346 


7490 




50 


933 


0 


5099 


3968 




100 


932 


0 


8075 


993 




200 


232 


806 


8962 


0 




500 


0 


1025 


8975 


0 
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Table 8 

Simulated Power of the Tests of Statistical Equivalence and Significance, Effect Size 

5 = 0.5 



Equivalence 


Sample Size 


Number of Rejections 




Bound 


(per group) 


(10000 Simulations) 




r 


N 


Equivalent 


Both 


Different 


Neither 


0.1 


10 


0 


0 


1897 


8103 




20 


0 


0 


3383 


6617 




50 


0 


0 


6981 


3019 




100 


0 


0 


9428 


572 




200 


0 


0 


9985 


15 




500 


0 


0 


10000 


0 


0.2 


10 


0 


0 


1804 


8196 




20 


0 


0 


3437 


6563 




50 


0 


0 


6905 


3095 




100 


1 


0 


9429 


570 




200 


0 


0 


9987 


13 




500 


0 


0 


10000 


0 


0.4 


10 


7 


0 


1866 


8127 




20 


117 


0 


3425 


6458 




50 


370 


0 


6938 


2692 




100 


236 


0 


9378 


386 




200 


13 


108 


9879 


0 




500 


0 


28 


9972 


0 
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Figure 1 . 
Figure 2. 
Figure 3. 
Figure 4. 
Figure 5. 
Figure 6. 
Figure 7. 
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Figure Captions 

Power of the Tests of Significance and Equivalence with N = 200 and r = 0.2. 

Power of the Tests of Significance and Equivalence with N — 200 and r = 0.4. 

Power of the Tests of Significance and Equivalence with 6 — t — 0.2. 

Power of the Tests of Significance and Equivalence with 6 — t — 0.4. 

Power of the Tests of Significance and Equivalence with 6 — 0.2 and r = 0.4. 
Simultaneous Power of the Tests of Significance and Equivalence with <5 = 0. 
Simultaneous Power of the Tests of Significance and Equivalence with 6 — 0.2. 
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Equivalence Testing with Hypothesis Testing, Figure 1 
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